## Prerequisites

We will use the Transformers library from HuggingFace which is pip-installable:

pip install transformers

You'll also probably want to use PyTorch

## Exercise 1: Tokenization and Exbedding Exploration

The aim of this exercise is to visualize how text is broken down into tokens and converted into embeddings. 

1) Create a short ten word sentence
2) Tokenize it using a tokenizer from the Hugging Face model bert-base-uncased
3) Decode the tokens back into words
4) Use the model's embedding layer to project tokens into vectors
5) Visualize the embeddings using PCA

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

## Exercise 2: Build Your Own Scaled Dot-Product Attention

This exercise gets you familiar with the attention mechanism from scratch on small data.

1) Generate small random matrices for queries, keys, and values
2) Implement the scaled dot-product attention:

$ Attention(Q, K, V) = softmax \left( \frac{QK^T}{\sqrt{d_k}} \right) V $

3) Visualize the attention weights as a heatmap

## Exercise 3: Multi-Head Attention 

This exercise shows how multi-head attention works by implementing a simplified version with synthetic data.

Repeat Ex. (2) with a synthetic input of 3 tokens, each with an 8-d embedding and 3 attention heads

## Exercise 4: Explore Attention on a Sentence

Here we will see how each word in a sentence attends to other in context.

1) Input a sentence into the DistilBERT model
2) Extract the attention weights from one or more layers
3) Use a heat map to visualize attention across words

Q. In your sentence, which words focus on others

Q. How does this vary between layers

In [None]:
from transformers import DistilBertModel, DistilBertTokenizer